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Abstract 

The problem of distributed or decentralized detection and estimation in applica¬ 
tions such as wireless sensor networks has often been considered in the framework 
of parametric models, in which strong assumptions are made about a statistical 
description of nature. In certain applications, such assumptions are warranted and 
systems designed from these models show promise. However, in other scenarios, 
prior knowledge is at best vague and translating such knowledge into a statistical 
model is undesirable. Applications such as these pave the way for a nonparametric 
study of distributed detection and estimation. In this paper, we review recent work 
of the authors in which some elementary models for distributed learning are con¬ 
sidered. These models are in the spirit of classical work in nonparametric statistics 
and are applicable to wireless sensor networks. 


1 Introduction 

Wireless sensor networks have attracted considerable attention in recent years p. Re¬ 
search in this area has focnsed on two separate aspects of snch networks: networking 
issnes, snch as capacity, delay, and renting strategies; and applications issnes. This pa¬ 
per is concerned with the second of these aspects of wireless sensor networks, and in 
particnlar with the problem of distribnted inference. Wireless sensor networks are a for¬ 
tiori designed for the purpose of making inferences about the environments that they are 
sensing, and they are typically characterized by limited communications capabilities due 
to tight energy and bandwidth limitations. Thus, distributed inference is a major issue 
in the study of such networks. Distributed detection and estimation is a well-developed 
held with a rich history. Much of the work in this area has focused on either parametric 
problems, in which strong statistical assumptions are made issiEiEiiiiniiiaEiiiiiizi, 
or on traditional nonparametric formalisms, such as constant-false-alarm-rate detection 
[2]. In this paper, we consider an alternative nonparametric approach to distributed 
inference that is relevant to wireless sensor networks, namely, distributed learning under 
communications constraints. 
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in part by Draper Laboratory under Grant IR&D 6002, in part by the National Science Foundation 
under Grants CCR-0020524 and CCR-0312413, and in part by the Office of Naval Research under Grant 
No. N00014-03-1-0102. 



Although [23 advocated a learning theory approach to sensor networks, na is the 
first work to consider the classical model for decentralized detection in a nonparametric 
setting. In the context of kernel methods commonly used in machine learning, the notion 
of a marginalized kernel is introduced in HT! to derive an efficient algorithm for designing 
a decentralized detection system based on a collection of training data. 

A related area of research lies in the study of ensemble methods in machine learning; 
examples of these techniques include bagging, boosting, mixtures of experts, and others 
mmiHiEiiini- These techniques are similar to the problem of interest here in that they 
aggregate many individually trained classifiers. However, the focus of these works is on 
the statistical and algorithmic advantages of learning with an ensemble and not on the 
nature of learning under communication constraints. Notably, ^2] considered an early 
model for learning with many individually trained hypotheses. 

The models considered in this paper have been studied in detail in [inii2D]. Here, 
we focus on the models and the main results and refer the reader to [1211201 for a more 
thorough discussion and proofs of these results. In particular, there is extensive work in 
nonparametric statistics that is closely related to the models considered in this paper; 
we touch on these connections throughout, but leave a more complete review and list of 
references to the full papers. 

In Section 2, we review the classical model for learning in nonparametric statistics and 
explain where our work departs the classical model. In Section 3, we discuss a model for 
distributed learning with distributed data under a family of simple communication models 
ra; we discuss our main results and connect them to related work in nonparametrics In 
Section 4, we discuss a model for distributed learning with specialists [IHIII21- Similarly, 
we connect the main result with known results in nonparametrics. Finally, we end with 
conclusions in Section 5. 


2 The Classical Learning Model & Our Departure 

Let us briefly review a standard model for learning in nonparametric statistics. For 
a thorough introduction to nonparametric statistics and classical centralized learning 
models, we refer the reader to iniiini- 

Let X and Y be A-valued and 3^-valued random variables, respectively, with a joint 
distribution denoted by Pxv- ^ is known as the feature, input, or observation space; y 
is known as the label or output space. Throughout, we will take X C and consider 
two cases corresponding to binary classification (3^ = {0,1}) and regression estimation 
(A’ = R). Of course, the decision-theoretic problem is to predict H, given an observation 
X. 

In parametric settings, one assumes prior knowledge of the distribution P^y- Defining 
a loss function / : 3^ x 3^ —R, one designs a decision rule that achieves the minimal 
expected loss L* = infg E{/(g'(X), H)}. In the binary classification setting, the criterion 
of interest is the probability of misclassification; we let l{y,y') = l{yj^y'}, the well-known 
zero-one loss. The structure of the risk minimizing decision rule is well-understood [2j; 
let gs '■ X ^ (0,1} denote this Bayes decision rule. 

In regression settings, we consider the squared error criterion; we let l{y, y') = \y—y'\^- 
It is well known that the regression function r}{x) = E{y \X = x} achieves the minimal 
expected loss. 

In nonparametric settings, prior knowledge of the distribution Pxy is not available 
and thus, computing the Bayes rule or regression function is not possible. Hope is not 


lost, for we are provided Dn = an independent and identically distributed 

(i.i.d.) collection of training data with (Xj, Yj) ~ Pxy for all i G {1, n}. The learning 
problem is to use this data to infer decision rules with small loss. That is, our decision 
rules are independent of Pxy, but can depend on the labeled examples in Dn] i.e., 
g{X) = g{X,Dr.). 

In this work, we focus on the information-theoretic property known as universal con¬ 
sistency jni ITUj . Though a thorough discussion of this property is beyond the scope of 
this paper, we will state the dehnition for the sake of continuity. 

Definition 1 Let = E{/(g'„(X, D„), P) |D„}. {gn\'^=i is said to he universally con¬ 
sistent z/E{L„} ^ L* for all distributions Pxy- 

Consistent with convention, we use gn{x) = gn{x,Dn) to denote decision rules in the 
binary classihcation setting and we use fi{x) = fi{x, Dn) to denote decision rules in the 
regression setting. 

The existence of universally consistent classihers and estimators was an open question 
until Stone’s Theorem [221 demonstrated that a wide range of classihers and estimators 
had this fundamental property; rules in this class are known as weighted-average rules and 
include histogram estimators, nearest-neighbor rules, and classical kernel rules. Extensive 
work in nonparametrics has extended this result to consider the consistency of Stone- 
type rules under various sampling processes; see, for example, p cni and references 
therein. These models focus on various dependency structures within the training data 
and assume that a single processor has access to the entire data stream. 

In distributed scenarios like sensor networks, different sensors have access to different 
data streams that differ in distribution and may depend on external parameters such as 
the state of a sensor network or location of a database. Moreover, sensors are unable to 
share all of their data with each other or with a central fusion center, as they may have 
only a few bits with which to communicate a summary. The nature of the work considered 
in this paper is to consider questions of universal consistency similar to those above but 
in this distributed environment. For a given model of communication amongst sensors, 
each of whom has been allocated a small portion of a larger learning problem, can enough 
information can be exchanged to allow for a universally consistent network? We consider 
several models that differ both in the way the learning problem is distributed amongst 
sensors and in the nature of the communication constraints. These models more closely 
resemble a distributed environment and present new questions to consider with regard 
to universal consistency. Insofar as these models present a useful picture of distributed 
scenarios, this paper addresses the issue of whether or not the guarantees provided by 
Stone’s Theorem in centralized environments hold in distributed settings. Notably, the 
models under consideration will be similar in spirit to their classical counterparts; indeed, 
similar techniques can be applied to prove results. 


3 Learning with Distributed Data 

3.1 The Model 

In this section, we present a model where the learning problem is divided amongst sensors 
by distributing examples from an i.i.d. training set amongst the sensors. In the classical 
setting, the training data Dn is provided to a single, centralized learning agent. Instead, 
suppose that for each i G {1, ...,n}, the training datum is received by a distinct 


member of a network of n sensors. When the fusion center observes a new observation 
X ~ Px, it broadcasts the observation to the network in a request for information. At 
this time, each sensor can respond with at most one bit. That is, each sensor chooses 
whether or not to respond to the fusion center’s request for information; if it chooses to 
respond, a sensor sends either a 1 or a 0 based on its local decision algorithm. Upon 
observing the response of the network, the fusion center combines the information to 
create an estimate of Y. As before, the key question is: do there exist sensor decision 
rules and a fusion rule that result in a universally consistent network in the limit as the 
number of sensors increases without bound? 

In Sections 3.2 and 3.4, we answer the question in the affirmative in both the binary 
classihcation and regression frameworks. In each framework, we demonstrate sensor 
decision rules and fusion rules that are universally consistent and connect the results to 
known work in nonparametrics. 

In this model, each sensor’s decision rule can be viewed as a selection of one of three 
states; abstain, vote and send 1, and vote and send 0. The option to abstain essentially 
allows the sensors to convey slightly more information than the one bit that is assumed to 
be physically transmitted to the fusion center. With this observation, these results can be 
interpreted as follows: log2(3) bits per sensor per classihcation is sufficient for universal 
consistency to hold for both distributed classihcation and regression with abstention. 

In this view, it is natural to ask whether these log2(3) bits are necessary. Can con¬ 
sistency results be proven at lower bit rates? Consider a revised model, precisely the 
same as above, except that in response to the fusion center’s request for information, 
each sensor must respond with 1 or 0; abstention is not an option and thus, each sensor 
responds with exactly one bit per classihcation. The same questions arise: are there rules 
for which universal consistency results hold in distributed classihcation and regression 
without abstention! 

In Section 3.3 and 3.5, we study distributed classihcation and regression in commu¬ 
nication without abstention. We observe that universally consistent networks can be 
designed in the classihcation regime; through a negative result, we observe that universal 
consistency it is not achievable in the regression framework. 


3.2 Distributed Classification with Abstention 

In this section, we show that the universal consistency of distributed classihcation with 
abstention follows immediately from Stone’s Theorem and the classical analysis of naive 
kernel classihers. Recall, y = {0,1} and for each i G {l,...,n}, the training datum 
(Xj, Yi) G Dn is received by a distinct member of a network of n sensors. 

To answer the question of whether a universally consistent network can be devised, 
let us construct one natural choice. Let 


and 


u, ifWGR.Jx) 

abstain, otherwise 


9n{.x) 


1 , 

0 , 


11 ^ Z 

Xi=l -'-{i^iCaij^abstain} 

otherwise 



( 1 ) 

( 2 ) 


so that gn{x) amounts to a majority vote fusion rule. With this choice, it is straightfor¬ 
ward to see that the net decision rule is equivalent to the plug-in kernel classiher rule 



with the naive kernel. Indeed, 


9n{.x) 


1 , 

0 , 


otherwise 



(3) 


With this equivalence, the universal consistency of the network follows from Stone’s The¬ 
orem applied to naive kernel classihers. With = P{g'„(X) ^ Y |D„}, the probability 
of error of the network conditioned on the random training data, we state this known 
result without proof as Theorem 1. 


Theorem ([6J) 1 If, as n —>■ oo, —>• 0 and {rnYn —*• oo, then E{L„} —>• L* for all 

distributions Pxv- 


3.3 Distributed Classification without Abstention 


As noted in Section 3.1, given the results of the last section, it is natural to ask whether 
the communication constraints can be tightened. Let us consider the second commu¬ 
nication model in which the sensors cannot choose to abstain. In effect, each sensor 
communicates one bit per decision. Recall, y = {0,1} and we again consider whether 
universally Bayes-risk consistent schemes exist for the network. 

Let {Z^ be a family {0, l}-valued random variables such that P{Z^ i = 1} = |. 
Consider the randomized sensor decision rule specihed as follows: 


y, ifx.eBrYx) 

Z- 1 , otherwise 

b 2 

That is, the sensors respond according to their training data if x is sufficiently close to 
Xi. Else, they simply “guess”, flipping an unbiased coin. 

A natural fusion rule is the majority vote: 



9n(X-) 


1. if 

0, otherwise 


( 5 ) 


Modifying our convention slightly, let = {{Xi,Yi, Z^ ri)}'i=i- Dehne 

Ln = P{gYX)Yy\Dn]- ( 6 ) 


That is, Ln is the conditional probability of error of the majority vote fusion rule con¬ 
ditioned on the randomness in sensor training and sensor decision rules. Assuming a 
network using the described decision rules. Proposition 1 specihes sufficient conditions 
for consistency. 


Proposition 1 If, as n —>■ oo, —>• 0 and {rnY^/n —>• oo, then E{L„} —>• L* for all 

distributions Pxy- 


Yet again, the conditions of the proposition strike a similarity with consistency results 
for kernel classihers using the naive kernel. Indeed, —>■ 0 ensures the bias of the 

classiher decays to zero. However, must not decay too rapidly. As the number of 

sensors in the network grows large, many, indeed most, of the sensors will be “guessing” 
for any given prediction; in general, only a decaying fraction of the sensors will respond 
with useful information. In order to ensure that these informative bits can be heard 



through the noise introduced by the guessing sensors, {rnYy/n —> oo. Note the difference 
between the result for naive kernel classifiers where {rnYn —> oo dictates a sufficient rate 
of convergence for 

To prove this result, we show directly that the expected probability of misclassification 
converges to the Bayes rate. This is unlike techniques commonly used to demonstrate the 
consistency of kernel classifiers, etc., which are so-called “plug-in” classification rules. In 
those settings, it suffices to show that the rules are based on consistent estimates of the 
a posteriori probabilities P{F = i |X}, i G {0,1}. However, for this model, we cannot 
estimate the a posteriori probabilities directly; the proof resorts to margin-based analysis 
[20j . These comments foreshadow the negative result of Section 3.5. 

3.4 Distributed Regression with Abstention 

Let us now move to a regression setting in which we focus on estimating a real-valued 
function in a bandwidth starved environment. The model remains the same except that 
3^ = R; that is, H is a R-valued random variable and likewise, sensors receive real-valued 
training data labels, Yi. As in Section 3.2, for each prediction, sensors are allowed to 
transmit one bit of information and they have the ability to abstain. To demonstrate 
that consistency can be achieved, let us devise candidate rules. 

For each integer n, let {Znfi}e&[Q,i] be a family of random {0, l}-valued random vari¬ 
ables parameterized by [0,1] such that for each 9 G [0,1], Znp is Bernoulli with parameter 
9. 

Let {cn}'Y=i and be arbitrary sequences of real numbers such that c„ —> cx) 

and 0 as n —> cx). Let local sensor decision algorithm 6ni{x) be defined as: 


\ ^ y- 1 ^ ; 

if X G Br„{Xi) and 1 

2 1 ^ C-fi 


[ abstain. 

if X G BrY^i) and 1 
otherwise 

i 1 ^ : 

(7) 


for i = 1,..., n. In words, the sensors choose to vote if Xi is close enough to X; to vote, 
they flip a biased coin, with the bias determined by Yi and the size of the network, n. 
Let us define the fusion rule: 


Vn{x) 


2c„( 


Sj=l Ai (x) (3,):^abstain} 

E n I 

2=1 -‘-{(5ni(ir)/abstain} 



( 8 ) 


In words, the fusion rule shifts and scales the average vote. 

Define = E{|? 7 n(X)—Hp |Dn} with the expectation taken over X, = {{Xi, 
and the randomness introduced in the sensor decision rules. Assuming a network using 
the described decision rules. Proposition 2 specifies sufficient conditions for consistency. 

Proposition 2 Suppose Pxy is such that Px is compactly supported and E{y^} < cxo. 
If, as n —> oo, 


I • ^ ? 

2. rn 0, and 


then E{L„} —> L*. 




Those familiar with the classical statistical pattern recognition literature will hnd the 
style of proof very familiar; special care must be taken to demonstrate that the variance 
of the estimate does not decrease too slowly compared to and to show that 

the bias introduced by the “clipped” sensor decision rules converges to zero. Note that 
the divergent scaling sequence is required for the general case when there is no 

reason to assume that Y has a known bound; in general, any decision rule which obeys 
the communication constraints will require a scaling sequence “like” . If, instead, 

\Y\ < B a.s. for some known i? > 0, it suffices to let Cn = B for all n. More generally, 
the constraint regarding the compactness of Px can be weakened. For a more detailed 
discussion, we refer the reader to pUj . 

3.5 Distributed Regression without Abstention 

Finally, let us consider the communication model from Section 3.3 in the regression 
setting. Now, 3^ = IR; sensors will receive real-valued training data labels 1^. When 
asked to respond with information, they will reply with either 0 or 1. We will argue that 
universal consistency is not achievable in this one bit regime. 

Let A = {a : x R'^ x R —[0,1]}. That is, A is the collection of functions 

mapping R'^ x R'^ X R to [0,1]. For every sequence of functions C A, there is a 

corresponding sequence of randomized sensor decision rules specihed by 

Sni{x) = Zi^an{x,Xi,Yi) i (9) 

for i G {1,..., n}. Let us consider the set of sensor decision rules so specihed. Note that 
as before, these sensor decision rules are allowed to depend on n and satisfy the same 
constraints imposed on the decision rules in the classihcation framework of Section 3.3. 

A fusion rule consists of a sequence of functions mapping R'^ x {0,1}” to 

3^ = R. To proceed, we require some regularity on We impose two natural 

constraints: (i) the fusion rule must be permutation invariant to the sequence of bits sent 
from the n sensors and (ii) the fusion rule should be Lipshitz in the average Hamming 
distance, i.e., there exists some constant C such that 

1 ” 

\fin{x,bi) - fin{x,b 2 )\ < \bu “ | ( 10 ) 

for all bit strings 6 i, 62 £ {0,1}"', all x G R'^, and every n. 

As usual, we will consider = E{|? 7 „(W) — \Dn} as the performance metric; here, 
the expectation is taken over X and any randomness introduced in the sensor decision 
rules themselves. The main result is as follows. 

Proposition 3 For every sequence of sensor decision rules specified accord¬ 
ing to (0) with a pointwise converging sequence of functions C A, there is no 

permutation invariant fusion rule {fin}^=i satisfying 071) such that 

lim E{L„| = L* (11) 

n—>00 ^ 

universally. 

The proof in ra proceeds by using to specify two random variables (W, Y) 

and (W'j Y') with r]{x) = E{y \X = x} EjE' \X' = x} = t]\x). Asymptotically, how¬ 
ever, the fusion center’s estimate will be indifferent to whether the sensors are trained 
with random data distributed according to Pxy or Px'y'- This observation will contra¬ 
dict universal consistency and complete the proof. 


4 Learning with Specialists 

In the previous section, the learning problem was divided amongst sensors by hrst sam¬ 
pling i.i.d. data according to the underlying unknown probability distribution; the data 
was then distributed amongst the sensors. In this section, we present a second model 
where sensors are hrst assigned random, local subsets of the observation space. Then, 
the sensors become specialized in these regions by observing i.i.d. training examples 
according the underly distribution, but which are constrained to fall within the sensor’s 
local region of specialization. As before, the sensors are constrained in the way they 
can communicate. With this new sampling process, we could naturally consider the four 
cases corresponding to classihcation and regression in communication models with and 
without abstention. However, to illustrate this model and to understand the underlying 
fundamentals, we will consider only the case of classihcation with abstention. Other cases 
can be considered similarly. Thus, when the network observes a new observation, the 
sensors can respond with up to one bit of information, retaining the hexibility to abstain. 
The fusion center combines this information to make a prediction. 

Though the distinction between the models may appear subtle, the diherence is es¬ 
sential and is motivated by a diherent set of sensor network applications and notions 
of being distributed. The model is perhaps most relevant in applications such a held 
estimation or other scenarios where a dimension of the observation space describes the 
position of a sensor; here expertise, rather than data, is distributed. Random assignment 
of regions of specialization can model random dispersal of sensors about an environment. 
Our model will be posed more generally as this allows us to understand the fundamental 
diherences between this model and its classical counterpart. 


4.1 The Model &; Main Result 

Consider a special case of a general formulation in (TH]. Let X = [0,1]'’* and y = {0,1}. 
Suppose n sensors are randomly assigned subsets of X in which to specialize; i.e., let 
be a collection of i.i.d. random variables uniformly distributed across X so that 
Br^{Qi) = {x G A : ||x —0i||2 < Tn} is the local region of specialization for sensor i. Each 
sensor samples a training datum according to the distribution P{A, R |A G 
That is, sensor i receives one labeled training example distributed according to 

Pxv, conditioned on Xi being in the sensor’s region of specialization, Br„{Qi). 

After training, the fusion center observes X ~ Px and broadcasts it to the network 
in a request for information; sensors respond with one bit according to the following rule: 


S^{X) 


y , agr.„(0O 

abstain, otherwise 


( 12 ) 


That is, the sensors respond with one bit according to their training data label as long 
as the new observation X falls within its region of specialization. Otherwise, they do not 
respond. 

A fusion center combines this information with a majority vote: 


9„(V) 


1, E’L. >5i(v) > i 

0 , otherwise 


(13) 


with A(i) = Efci 

For example, if d = 2, we might regard A as a cityscape so that each x G A is a 
location and y E y is a. binary label describing whether a toxin is present. With this 



example, X is a random location of interest to an analyst and F is a random realization 
of the toxicity; the sensors randomly deployed across X form a sensor network that learns 
toxicity as a function of position. 

The overriding question is whether this network can be designed to be universally 
consistent. Let = 'P{gn{X) ^ Y |{0„i, Xj, denote the expected probability of 

error of gn conditioned on random sensor specialization and training. Can the network 
choose such that E{L„} —> L* universally? This question is answered in the following 
proposition. 

Proposition 4 //r„ —>■ 0 and {vnYu —> oo, then E{Ln} —> L* for all distributions Pxy- 

Though the proposition resembles classic results for the universal consistency of kernel 
classihers jH] , the key difference is the process from which training data is sampled. In tra¬ 
ditional models, the sampling process by which training data is received is independent 
and identically distributed according to underlying unknown probability distribution. 
In the current model, though i.i.d, training examples are generated from a distribution 
which hrst depends on a random allocation of sensors in the observation space. This 
distribution is in general different than the underlying distribution Pxy and moreover, 
it evolves with n as the sensors grow dense in the observation space. Though this differ¬ 
ence may appear to be technical, it is fundamental, arises from the study of distributed 
learning, and precludes applying previously known results in nonparametrics. In ng, a 
more general formulation results in a theorem that quantifies the difference through a 
universal relationship between the random specialization of the sensors and probability 
distributions dehned on compact observation spaces. The proof is of a similar style to 
classical consistency results, taking into account such differences. 


5 Conclusions 

In this paper, we have described several models for distributed learning within commu¬ 
nication constraints, and we have discussed issues of consistency within them. These 
models are of particular interest for potential applications in wireless sensor networks, 
given the intent of and constraints on these networks. It is anticipated that further 
research will lead to such applications. 
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